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SUMMARY 


Overview 


One  way  that  people  can  express  their  confidence  in  the 
accuracy  of  their  own  knowledge  is  to  use  probabilities  (e.g. , 
the  probability  that  event  A  will  occur — or  that  intelligence 
report  B  is  true — is  .75).  One  measure  of  the  adequacy  of 
probability  assessments  is  called  calibration.  A  set  of 
probability  assessments  are  well  calibrated  if,  in  the  long 
run,  the  proportion  of  events  that  occur  or  statements  that 
are  true  is  equal  to  the  assessed  probability.  Thus,  for 
example,  your  assessments  of  .75  are  well  calibrated  if  just 
75%  of  the  events  in  question  occur.  The  research  project 
under  which  the  present  paper  was  written  has  as  its  goal  to 
explore  the  psychology  of  confidence  as  expressed  via 
probabilities . 

Background 

A  large  research  literature  exists  on  the  calibration  of 
probabilities.  However,  most  of  the  research  has  employed 
naive  participants  who  have  received  only  very  brief  instructions 
concerning  probability.  The  present  report  compares  the 
calibration  of  participants  given  only  the  usual  brief  instruc¬ 
tions  with  the  calibration  of  those  who  were  presented  with 
lengthy  instructions  that  more  fully  explained  probability  and 
calibration.  In  addition,  the  present  report  explores  one 
possible  cultural  source  of  differences  in  confidence,  gender. 

If  it  is  true  that  males  in  our  culture  are  socialized  to  be 
confident  whereas  females  are  trained  to  be  modest,  or  even 
deprecatory,  about  their  abilities,  one  might  expect  that 
females  would  be  less  confident  when  assessing  probabilities. 
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Approach 

The  task  was  to  decide,  for  each  of  200  general-knowledge 
questions,  which  of  two  possible  answers  was  correct  (e.g. , 

"The  spleen’s  function  is  to  filter  [a]  blood,  [b]  lymph") 
and  to  assess  the  probability  that  the  chosen  answer  was  indeed 
the  correct  one.  About  half  of  the  34  male  and  37  female 
subjects  were  given  short  instructions;  the  others  were  given 
long  instructions. 

Findings  and  Implications 

There  was  no  effect  on  calibration  or  confidence  due  to 
instructions.  This  finding  is  consistent  with  previous  research 
suggesting  that  overconfidence  is  more  related  to  cognitive 
difficulties  than  to  unfamiliarity  with  the  response  scale. 

In  addition,  males  and  females  did  not  differ  with  respect 
to  calibration  or  confidence. 
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THE  EFFECTS  OF  GENDER  AND  INSTRUCTIONS 


Suppose  you  were  asked,  "Which  is  longer,  the  Suez  Canal  or 
the  Panama  Canal?",  and  further  requested  to  assess  the 
probability  that  your  chosen  answer  is  correct.  Such  assessments 
express  your  confidence  in  your  own  knowledge.  A  burgeoning 
research  area  (reviewed  in  Lichtenstein,  Fischhoff  &  Phillips, 
in  press)  deals  with  the  appropriateness,  or  calibration,  of 
such  expressions  of  confidence.  Probabilities  are  well 
calibrated  if,  over  the  long  run,  one  is  correct  XX%  of  the  times 
that  one  attaches  a  probability  of  .XX  to  an  answer. 

The  overwhelming  finding  of  this  research  is  that,  with 
questions  of  moderate  difficulty,  probability  assessors  are 
usually  overconfident.  For  example,  they  are  typically  correct 
on  only  75%  of  the  occasions  that  they  assign  a  probability  of 
.9.  Such  overconfidence  is  usually  interpreted  as  evidence  that 
people  exaggerate  the  accuracy  of  their  knowledge.  An  alternative 
explanation  is  that  people  simply  do  not  understand  the 
probabilistic  response  scale.  Most  laboratory  research 
documenting  overconfidence  has  used  quite  brief  explanations  of 
that  scale;  seldom  has  calibration  (the  criterion  on  which 
subjects'  performance  is  evaluated)  been  explicitly  described. 

The  present  research  compares  the  calibration  of  people  given 
such  short  instructions  with  the  calibration  of  people  given 
lengthier  instructions  including  an  explicit  explanation  of 
calibration. 

The  longer  instructions  are  similar  to  those  used  in  a 
calibration  training  study  (Lichtenstein  &  Fischhoff,  1980).  In 
that  study,  we  were  surprised  to  find  that  one  third  of  our 
subjects  appeared  to  be  well  calibrated  prior  to  any  training. 
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Although  we  suspected  that  this  prowess  reflected  something 
unusual  about  these  particular  subjects  (who  had  been  recruited 
by  personal  contact) ,  it  could  have  been  due  to  the  more 
extensive  instructions  used. 

We  also  explore  in  the  present  study  the  possibility  that 
males  and  females  differ  in  their  degree  of  overconfidence.  The 
popular  wisdom  of  today  is  that  in  our  culture  males  are 
socialized  to  be  confident  whereas  females  sure  trained  to  be 
modest,  or  even  deprecatory,  about  their  abilities.  If  this  is 
the  case,  then  females  might  show  less  confidence  than  equally 
knowledgeable  males.  The  result  would  be  lessened  overconfidence 
and  improved  calibration. 


Method 


Subjects 

The  subjects  were  34  males  and  37  females  who  answered  an 
ad  in  the  University  of  Oregon  student  newspaper.  The  present 
task  was  one  of  two  paper-and-pencil  judgment  tasks  performed  in 
group  settings  lasting  an  hour  and  a  half.  Subjects  were  paid 
for  their  participation. 

Items 


The  items  were  200  general-knowledge  questions  with  two 
alternative  answers  (e.g.,  "Tricolor  is  the  name  of  the: 

A.  Swiss  national  flag;  B.  French  national  flag;"  "The  spleen's 
function  is  to  filter:  A.  Blood;  B.  Lymph").  These  items  had 
been  used  before,  as  the  first  set  of  computer-presented  training 
items,  by  Lichtenstein  and  Fischhoff  (1980) . 
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One  group  of  subjects  (14  males  and  19  females)  received  the 
short  instructions;  the  other  group  (20  males  and  18  females) 
received  long  instructions.  The  instructions  were  given  in 
typed  form  ana  read  aloud  by  the  experimenter.  Subjects  then 
proceeded  at  their  own  pace.  For  each  item  they  first  chose  the 
correct  answer  and  then  indicated  the  probability  (.5  to  1.0) 
that  their  choice  was  correct. 

Instructions 

The  short  instructions  were  the  same  as  we  have  used  in 
other  calibration  research  (e.g.,  Lichtenstein  &  Fischhoff,  1977). 
They  read,  in  full: 

This  task  is  composed  of  200  items.  Each  item  is  a 
brief  phrase  followed  by  two  alternatives,  labeled  A  and  B. 
Only  one  of  the  alternatives  is  correct.  Read  each  item 
and  the  two  alternatives  carefully.  First,  decide  which 
alternative  you  think  is  correct,  and  mark  your  answer  on 
the  answer  sheet.  Please  indicate  an  answer,  either  A  or 
B,  even  when  you  are  completely  unsure  which  is  correct. 

Then  in  the  space  provided  to  the  right  of  your  answer  place 
a  probability  value  indicating  how  sure  you  are  that  your 
answer  is  correct.  This  probability  can  be  any  number  from 
.5  to  1.0.  It  can  be  interpreted  as  your  degree  of 
certainty  about  the  correctness  of  your  answer.  For 
example,  if  you  respond  that  the  probability  is  .60,  it 
means  that  you  believe  that  there  are  about  6  chances  out 
of  10  that  your  answer  is  correct.  A  response  of  1.00 
means  that  you  are  absolutely  certain  that  your  answer  is 


correct.  A  response  of  .50  means  that  your  best  guess 
is  as  likely  to  be  right  as  wrong.  Don't  estimate  any 
probability  below  . 50 ,  because  you  should  always  be  picking 
the  alternative  that  you  think  is  more  likely  to  be  correct. 
Write  your  probability  in  the  space  provided  on  the  answer 
sheet. 

To  repeat,  this  probability  is  a  measure  of  your 
degree  of  certainty  that  your  chosen  alternative  is  the 
correct  alternative.  It  is  a  number  from  .5  to  1.0  where 
.5  means  complete  uncertainty  and  1.0  means  complete 
certainty. 

Don't  worry  if  you  don't  know  the  answers  to  some 
items .  We're  not  so  much  interested  in  how  much  you  know 
as  we  are  interested  in  how  well  you  can  express  your  own 
feelings  of  knowing  or  not  knowing  in  the  probability 
response. 

The  long  instructions  were  three  single-spaced  typewritten 
pages.  In  addition  to  the- points  made  in  the  short  instructions, 
the  long  instructions  included: 

.  .  .  The  more  certain  you  are  that  you  are  right,  the 
larger  the  number  you  should  choose.  But  what  number 
should  you  choose?  This  is  the  nub  of  the  problem.  We 
are  asking  you  to  do  a  very  difficult  task.  We  want  you 
to  examine  your  own  "gut  feelings"  of  certainty  and 
uncertainty  and  translate  those  feelings  into  a  probability 
number. 

A  paragraph  explaining  why  the  probability  response  must  be 
equal  to  or  greater  than  .5  ended  with: 


4 


.  .  .  So  a  probability  of  less  than  .5  suggests  that  you 
goofed  the  first  step,  by  not  choosing  the  alternative 
which  is  most  likely  correct. 

A  paragraph  explaining  that  one  could  use  any  number  of 
digits,  like  .703  or  .832319,  noted: 

.  .  .  but  you  will  find  out  very  soon  that  you  are  not 
capable  of  making  subtle  discriminations  such  as 
deciding  whether  to  give  a  .703  or  a  .704.  You  probably 
won't  want  to  use  numbers  with  a  lot  of  fancy  extra  digits. 
.  .  .  And  how  do  you  decide  whether  to  say  .6  or  .7?  You 
have  to  review  all  the  information  you  have  in  your  head 
about  the  item  in  question,  and  gauge  how  confident  you 
are  about  the  correctness  of  your  choice. 

The  remainder  of  the  instructions  discussed  calibration. 

The  subjects  were  told  their  goal  was: 

...  to  translate  your  own  internal  feelings  of  certainty, 
uncertainty,  and  partial  certainty  into  the  precise 
language  of  probability  numbers.  We  want  you  to  be  well 
calibrated  in  the  same  sense  that  a  thermometer  is  well 
calibrated.  When  a  calibrated  instrument  says  32 °F,  it 
means  the  same  thing  every  time,  and  it  means  something 
very  specific:  the  temperature  at  which  water  freezes. 

Likewise,  you  should  mean  the  same  thing  every  time 
you  say  .5.  That  means  (a)  I'm  completely  uncertain 
between  the  two  possible  answers  and  (b)  on  average,  I 
have  a  50%  chance  of  getting  this  one  right. 
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The  responses  of  two  hypothetical  subjects  were  presented  in  the 
instructions.  The  experimenter  amplified  the  written  instructions 
at  this  point,  explaining  in  detail  how  to  read  the  tables: 


Paul 

Said 

How  Many 
Times 

Times 

Right 

Times 

Wrong 

Percent 

Correct 

.5 

30 

15 

15 

50 

.6 

10 

6 

4 

60 

.7 

10 

7 

3 

70 

.75 

20 

15 

5 

75 

.9 

10 

9 

1 

90 

1.0 

20 

20 

0 

1 00 

Totals 

100 

72 

28 

72% 

Baruch 

Said 

How  Many 
Times 

Times 

Right 

Times 

Wrong 

Percent 

Correct 

.5 

30 

18 

12 

60 

.6 

10 

8 

2 

80 

.7 

10 

8 

2 

80 

.75 

20 

13 

7 

65 

.9 

10 

9 

1 

90 

1.0 

20 

16 

4 

80 

Totals 

100 

72 

28 

72% 

The  instructions  continued: 

.  .  .  [Paul]  is  perfectly  calibrated,  because  his  response 
is  always  equal  to  the  percent  correct.  For  exactly  70%  of 
all  the  times  he  said  ".7,”  he  was  right,  and  30%  of  the 
time,  he  was  wrong.  He  got  half  of  his  ”.5"  responses 
right,  and  all  of  his  "1.0"  responses  right,  and  so  on. 

.  .  .  Baruch  was  not  well  calibrated.  For  only  one  class 
of  his  responses  was  he  "right  on":  he  did  get  exactly 
90%  of  his  ".9"  responses  correct.  But  otherwise,  he 
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didn't  use  the  probabilities  the  way  he  should  have.  Across 
the  30  times  he  said  ".5"  he  got  60%  of  them  right,  instead 
of  the  desired  50%.  This  is  a  kind  of  underconf idence ;  he 
knew  more  than  he  thought  he  knew.  At  the  other  extreme, 
he  was  wrong  too  often  when  he  said  "1.0" — he  got  only 
80%  right  (to  be  perfectly  calibrated,  you  can  never  be 
wrong  when  you  say  "1.0").  This  is  overconfidence ;  he 
knew  less  than  he  thought  he  knew. 

Notice  that  Paul  and  Baruch  both  got,  overall,  72%  of 
their  answers  correct.  They  both  have  the  same  degree  of 
knowledge.  But  knowledge  is  independent  of  calibration. 

So  don't  worry  about  how  much  you  know  and  don’t  know  in 
this  experiment — we  don't  care  much  about  that. 

Results 


Mode  of  Analysis 


Two-way  analyses  of  variance  (Instructions  x  Gender)  were 
run  on  the  following  measures ,  calculated  separately  for  each 
subject: 


(1)  Percentage  of  correct  answers 

(2)  Mean  probabilistic  response 

(3)  Overconfidence:  the  signed  difference  between  the 
mean  response  and  the  proportion  correct.  A  positive 
difference  indicates  overconfidence;  a  negative 
difference,  underconf idence. 

(4)  Calibration:  The  mean  squared  difference  between 
each  probabilistic  response  and  the  proportion  correct 
within  that  response  category,  weighted  by  the  number 


7 


i 


of  responses  in  each  category.  For  perfect 
calibration,  this  measure  would  be  zero.  The  largest 
calibration  score  we  have  ever  observed  over  200  items 
is  .115.  Since  this  measure  is  highly  sensitive  to 
the  number  of  different  responses  used,  all  data  were 
grouped  into  six  response  categories  before  calculating 
the  measure.  These  were:  .5-. 59,  .6-. 69,  .  .  .  , 

.9-. 99,  and  1.0.  For  further  discussion  of  this 
measure,  see  Lichtenstein  and  Fischhoff  (1977). 

(5)  Proportion  of  times  a  subject  responded  "1.0." 

(6)  Percentage  correct  when  responding  "1.0." 

The  means  of  these  measures  are  shown  in  Table  1. 

Effect  of  Instructions 


The  instructions  had  no  statistically  significant  effect  on 
any  measure.  These  results  reinforce  our  suspicion  that  the 
unusually  good  calibration  of  some  subjects  in  Lichtenstein  and 
Fischhoff  (1980)  reflects  something  about  those  subjects  rather 
than  something  about  the  (long)  instructions  they  had  received. 

Of  the  71  subjects  in  the  present  experiment,  only  6  had 
calibration  scores  of  less  than  .010  (which  we  consider  to  be 
an  upper  bound  for  calling  someone  "pretty  well  calibrated"). 

The  calibration  curve  (Figure  1)  of  all  subjects  combined  shows 
overconfidence  similar  to  that  reported  so  often  in  past  studies. 
It  is  typical  of  most  of  the  present  subjects,  only  fJve  of  whom 
were  not  overconfident. 

Gender  Differences 


Males  had  a  higher  percentage  correct  (66  vs.  62)  and  gave 
higher  probabilistic  responses  (.76  vs.  .72)  than  did  females. 


Table  1 


Means  for  All  Performance  Measures 


Long  Short 

Instructions  Instructions  Combined 


65  67  66 

62  62  62** 


Percentage  of  correct  answers 

Male 

Female 

Mean  probabilistic  response 

Male 

Female 

Over conf idence 

Male 

Female 

Calibration 

Male 

Female 

Proportion  of  "1.0"  use 

Male 

Female 

Percentage  correct  for  "1.0" 

Male 

responses 

Female 

Number  of  subjects 

Male 

Female 

76 

.77 

.76 

74 

.71 

.72* 

10 

.10 

.10 

12 

.08 

.10 

031 

.030 

.031 

035 

.028 

.031 

29 

.34 

.31 

25 

.20 

.22* 

83 

.84 

.84 

79 

.82 

.81** 

20 

14 

34 

18 

19 

37 

total 

71 

Note:  There  were  no  significant  differences  between  long  and  short 
instructions.  Significant  gender  differences  are  shown  as:  *  p  <  .01 

**  p  <  .001 
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[ 


That  is,  they  knew  4%  more  of  the  answers  to  these  particular 
general-knowledge  questions  and  had,  on  the  average,  .04  more 
confidence  in  their  answers.  As  a  result,  both  genders  were 
equally  overconfident.  They  were  also  equally  well  (or  poorly) 
calibrated,  a  result  that  is  frequently,  but  not  necessarily, 
associated  with  equivalent  overconfidence.  One  reflection  of 
males'  greater  confidence  was  a  greater  propensity  to  use  "1.0" 
responses  (31%  vs.  22%  of  all  responses) .  They  were  also 
correct  slightly  more  often  when  saying  "1.0"  (84%  vs.  81%) ,  a 
result  that  seems  to  have  no  particular  significance.  Within 
each  gender,  those  who  used  "1.0"  more  often  tended  to  have 
fewer  of  those  responses  correct  (r  =  -.42  for  males  and  -.50 
for  females) . 


Discussion 

Using  long  instructions  with  explicit  explanations  of 
calibration  did  nothing  to  challenge  the  well-documented 
conclusion  that  people  are  overconfident  and  poorly  calibrated 
for  general-knowledge  questions  of  moderate  difficulty.  These 
results  are  also  consistent  with  other  results  (reviewed  by 
Fischhoff,  in  press)  indicating  that  poor  calibration  is  not  due 
simply  to  a  misunderstanding  of  the  response  scale.  For  example, 
Fischhoff,  Slovic  and  Lichtenstein  (1977)  found  overconfidence 
with  odds  assessments,  as  well  as  with  the  more  usual  probability 
responses.  They  also  found  (as  we  did  here)  that  subjects  chose 
the  wrong  alternative  all  too  often  when  using  the  response  of 
1.0.  Since  people  should  know  what  it  means  to  say  "I'm  sure," 
this  response  cannot  be  accused  of  ambiguity  or  unfamiliarity. 

In  contrast,  Koriat,  Lichtenstein  and  Fischhoff  (1980)  were  able 
to  reduce  overconfidence  without  any  explanation  of  the  response 
scale  beyond  the  short  instructions  used  here.  They  did  so  by 
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asking  their  subjects  to  list  one  or  more  reasons  why  the 
answer  they  had  chosen  might  be  wrong.  Thus,  overconfidence  in 
one's  knowledge  appears  to  be  due  more  to  cognitive  difficulties 
than  to  unfamiliarity  with  probabilistic  response  modes. 

Our  finding  that  males  know  more  answers  to  trivia  questions 
than  do  females  has  also  been  reported  by  Nelson  and  Narens  (1980) . 
Using  a  recall  task,  they  found  that  male  college  students  more 
often  produced  the  correct  answer  than  did  female  college  students 
for  86%  of  their  300  questions. 

The  slightly  greater  knowledge  of  our  male  subjects  was 
paralleled  by  slightly  greater  confidence,  leaving  the  two 
gender  groups  equally  overconfident.  Although  there  were  no 
overall  differences  in  calibration,  males  used  the  certainty 
response  (1.0)  somewhat  more  appropriately. 

Finally,  we  found  a  hint  of  an  individual  difference  which 
might  be  worth  pursuing:  within  each  gender  group,  the  more 
often  subjects  used  1.0,  the  less  often  they  were  right  on  those 
assessments.  This  finding  might  be  related  to  the  modest 
(r  51  .30)  correlations  reported  by  Hession  and  McCarthy  (Note  1) 
and  by  Wright  and  Phillips  (1976)  between  calibration  and  the 
Authoritarianism  (F)  Scale. 
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